Code Plateau Fellowship; Data Science Track Competition¶

Context:¶

The Tanzanian tourism sector plays a significant role in the Tanzanian economy, contributing about 17% to the country’s GDP and 25% of all foreign exchange revenues. The sector, which provides direct employment for more than 600,000 people and up to 2 million people indirectly, generated approximately $2.4 billion in 2018 according to government statistics. Tanzania received a record 1.1 million international visitor arrivals in 2014, mostly from Europe, the US and Africa. Tanzania is the only country in the world which has allocated more than 25% of its total area for wildlife, national parks, and protected areas.There are 16 national parks in Tanzania, 28 game reserves, 44 game-controlled areas, two marine parks and one conservation area.

Objective:¶

The objective of this competition is to explore and build a linear regression model that will predict the spending behaivior of tourists visiting Tanzania.The model can be used by different tour operators and the Tanzania Tourism Board to automatically help tourists across the world estimate their expenditure before visiting Tanzania.

Data Description¶

The dataset describes 6476 rows of up-to-date information on tourist expenditure collected by the National Bureau of Statistics (NBS) in Tanzania.The dataset was collected to gain a better understanding of the status of the tourism sector and provide an instrument that will enable sector growth. The survey covers seven departure points, namely: Julius Nyerere International Airport, Kilimanjaro International Airport, Abeid Amani Karume International Airport, and the Namanga, Tunduma, Mtukula and Manyovu border points.

Importing necessary libraries and data¶

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline
from pandas_profiling import ProfileReport
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
In [2]:
# Loading data
df=pd.read_csv('Train .csv')
df_test=pd.read_csv('Test .csv')
Final_df= df_test.copy()

Data Overview¶

  • Observations
  • Sanity checks
In [3]:
# Data Preview
df.head()
Out[3]:
ID country age_group travel_with total_female total_male purpose main_activity info_source tour_arrangement ... package_transport_tz package_sightseeing package_guided_tour package_insurance night_mainland night_zanzibar payment_mode first_trip_tz most_impressing total_cost
0 tour_0 SWIZERLAND 45-64 Friends/Relatives 1.0 1.0 Leisure and Holidays Wildlife tourism Friends, relatives Independent ... No No No No 13.0 0.0 Cash No Friendly People 674602.5
1 tour_10 UNITED KINGDOM 25-44 NaN 1.0 0.0 Leisure and Holidays Cultural tourism others Independent ... No No No No 14.0 7.0 Cash Yes Wonderful Country, Landscape, Nature 3214906.5
2 tour_1000 UNITED KINGDOM 25-44 Alone 0.0 1.0 Visiting Friends and Relatives Cultural tourism Friends, relatives Independent ... No No No No 1.0 31.0 Cash No Excellent Experience 3315000.0
3 tour_1002 UNITED KINGDOM 25-44 Spouse 1.0 1.0 Leisure and Holidays Wildlife tourism Travel, agent, tour operator Package Tour ... Yes Yes Yes No 11.0 0.0 Cash Yes Friendly People 7790250.0
4 tour_1004 CHINA 1-24 NaN 1.0 0.0 Leisure and Holidays Wildlife tourism Travel, agent, tour operator Independent ... No No No No 7.0 4.0 Cash Yes No comments 1657500.0

5 rows × 23 columns

In [4]:
df.duplicated().sum()
Out[4]:
0
In [5]:
df.isnull().sum()
Out[5]:
ID                          0
country                     0
age_group                   0
travel_with              1114
total_female                3
total_male                  5
purpose                     0
main_activity               0
info_source                 0
tour_arrangement            0
package_transport_int       0
package_accomodation        0
package_food                0
package_transport_tz        0
package_sightseeing         0
package_guided_tour         0
package_insurance           0
night_mainland              0
night_zanzibar              0
payment_mode                0
first_trip_tz               0
most_impressing           313
total_cost                  0
dtype: int64
In [6]:
df_test.duplicated().sum()
Out[6]:
0
In [7]:
df_test.isnull().sum()
Out[7]:
ID                         0
country                    0
age_group                  0
travel_with              327
total_female               1
total_male                 2
purpose                    0
main_activity              0
info_source                0
tour_arrangement           0
package_transport_int      0
package_accomodation       0
package_food               0
package_transport_tz       0
package_sightseeing        0
package_guided_tour        0
package_insurance          0
night_mainland             0
night_zanzibar             0
payment_mode               0
first_trip_tz              0
most_impressing          111
dtype: int64

Treating Missing Values¶

In [8]:
# For the travel_with i decided to go with most frequent option "Alone"
df.travel_with.fillna('Alone',inplace=True)

# Most_impressing column captured the most frequent option "Friendly People"
df.most_impressing.fillna('Friendly People',inplace=True)

# For the female and male columns, filled them with their respective mode
df.total_female.fillna(df.total_female.mode()[0],inplace = True)

df.total_male.fillna(df.total_female.mode()[0],inplace = True)

!Treating Missing values on Test Dataset¶

In [9]:
# For the travel_with i decided to go with most frequent option "Alone"
df_test.travel_with.fillna('Alone',inplace=True)

# Most_impressing column captured the most frequent option "Friendly People"
df_test.most_impressing.fillna('Friendly People',inplace=True)

# For the female and male columns, filled them with their respective mode
df_test.total_female.fillna(df.total_female.mode()[0],inplace = True)

df_test.total_male.fillna(df.total_female.mode()[0],inplace = True)

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.
In [10]:
# Descriptive Summary
df.describe()
Out[10]:
total_female total_male night_mainland night_zanzibar total_cost
count 4809.000000 4809.000000 4809.000000 4809.000000 4.809000e+03
mean 0.926804 1.009565 8.488043 2.304429 8.114389e+06
std 1.287841 1.138273 10.427624 4.227080 1.222490e+07
min 0.000000 0.000000 0.000000 0.000000 4.900000e+04
25% 0.000000 1.000000 3.000000 0.000000 8.121750e+05
50% 1.000000 1.000000 6.000000 0.000000 3.397875e+06
75% 1.000000 1.000000 11.000000 4.000000 9.945000e+06
max 49.000000 44.000000 145.000000 61.000000 9.953288e+07
In [11]:
plt.figure(figsize=(15,7))
sns.heatmap(df.corr(), annot=True ,cmap="magma")
plt.show()

Questions:

  1. What are the top 5 countries with the highest spending statistic ?

  2. which age-group are the highest spenders and who are the over all highest spenders by travel with?

  3. which country are have the most spending tourists?

  4. what is the average number of nights a toursits spends in Tanzania mainland?

  5. what is the average number of nights a toursits spends in Zanzibar?

  6. what is the most prefered payment mode by tourists?

  7. Highlight the Aspects of tourism that are more profitable and in which it is worthwhile to invest in

  8. what is the most sort after food by tourists?

Solution 1¶

In [12]:
dats=df.groupby(['country'], sort=False)['total_cost'].sum().reset_index()
#dats= df.groupby('country').agg({'total_cost':['sum','count']})
print(dats)
#dats.to_frame()
                      country    total_cost
0                  SWIZERLAND  7.078238e+08
1              UNITED KINGDOM  3.808383e+09
2                       CHINA  4.296282e+08
3                SOUTH AFRICA  2.594805e+09
4    UNITED STATES OF AMERICA  8.890832e+09
..                        ...           ...
100                   URUGUAY  1.657500e+05
101                   MORROCO  1.491750e+06
102                  THAILAND  1.408875e+06
103                   BERMUDA  2.000000e+05
104                   ESTONIA  2.817750e+06

[105 rows x 2 columns]
In [13]:
# To Find the top 5 countries with the highest spending statistic
top_country= dats.nlargest(5,['total_cost'])
top_country
Out[13]:
country total_cost
4 UNITED STATES OF AMERICA 8.890832e+09
1 UNITED KINGDOM 3.808383e+09
21 ITALY 3.762160e+09
20 FRANCE 3.344496e+09
30 AUSTRALIA 2.743132e+09
In [14]:
px.bar(top_country, x = 'country', y = 'total_cost', title = 'TOP 5 COUNTRIES WITH THE HIGHEST SPENDING', color_discrete_sequence = ['darkred'])

Solution 2¶

Spending statistics by age_group¶

In [15]:
from plotnine import ggplot, aes, geom_boxplot, geom_bar, facet_wrap, theme, ggtitle
In [16]:
ggplot(df,aes(x='age_group',y='total_cost'))+ \
geom_boxplot(color='lightskyblue',fill=['c','g','y','r'])+ ggtitle("Age_group Total_cost boxplot")
Out[16]:
<ggplot: (118565977426)>

From the above it can be seen that the highest age-group spenders is the 25-44 followed by 45-64 age-group and the 65+ group spends the least.

Spending statistics by travel_with¶

In [17]:
# The over all highest spenders by travel with
ggplot(df,aes(x='travel_with',y='total_cost'))+ \
geom_boxplot(colour="green",fill="lightskyblue")+ ggtitle("Travel_with Total_cost boxplot")
Out[17]:
<ggplot: (118578927330)>

From the above, tourist spending was more with Friends/Relatives, followed by with Spouse and children.

In [18]:
#com_dats=df.groupby(['age_group','travel_with'])['total_cost'].sum().nlargest().reset_index()
#print(com_dats)
In [19]:
#ggplot(com_dats,aes(x='age_group',y='total_cost',fill='travel_with'))+ \
#geom_bar(stat= "identity")+ ggtitle("Age_group and Travel_with")

Solution 4¶

From the Descriptive summary in early part of this project the mean for night_mainland is 8.488043, therefore the average number of nights a toursits spends in Tanzania mainland is 8 nights approx.

Solution 5¶

From the Descriptive summary in early part of this project the mean for night_zanzibar is 2.304429, therefore the average number of nights a toursits spends in Tanzania mainland is 2 nights approx.

Solution 6¶

In [20]:
sns.countplot(x='payment_mode',data=df,hatch='.')
Out[20]:
<AxesSubplot:xlabel='payment_mode', ylabel='count'>

The most prefered payment mode by tourist is Cash as clearly seen above.

Solution 7¶

In [21]:
Activity_dats=df.groupby(['main_activity'], sort=False)['total_cost'].sum().reset_index()
#dats= df.groupby('country').agg({'total_cost':['sum','count']})
print(Activity_dats)
              main_activity    total_cost
0          Wildlife tourism  2.393484e+10
1          Cultural tourism  1.432819e+09
2         Mountain climbing  4.359085e+08
3             Beach tourism  7.712958e+09
4        Conference tourism  3.782597e+09
5           Hunting tourism  8.734764e+08
6             Bird watching  1.560128e+08
7                  business  4.712545e+08
8  Diving and Sport Fishing  2.222264e+08
In [22]:
px.bar(Activity_dats, x = 'main_activity', y = 'total_cost', title = 'Main Activity Statistics', color_discrete_sequence = ['green'])

From the Analysis above the most profitable tourism sectors in Tanzania are mainly “Wildlife tourism”, followed by “Beach tourism”. It is therefore wise and worthwhile to invest in this sectors.

Data Preprocessing¶

  • Missing value treatment (if needed)
  • Feature engineering
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Feature importance
  • scaling

Checking for Missing Values¶

In [23]:
# Loading wrangled data
df.head()
Out[23]:
ID country age_group travel_with total_female total_male purpose main_activity info_source tour_arrangement ... package_transport_tz package_sightseeing package_guided_tour package_insurance night_mainland night_zanzibar payment_mode first_trip_tz most_impressing total_cost
0 tour_0 SWIZERLAND 45-64 Friends/Relatives 1.0 1.0 Leisure and Holidays Wildlife tourism Friends, relatives Independent ... No No No No 13.0 0.0 Cash No Friendly People 674602.5
1 tour_10 UNITED KINGDOM 25-44 Alone 1.0 0.0 Leisure and Holidays Cultural tourism others Independent ... No No No No 14.0 7.0 Cash Yes Wonderful Country, Landscape, Nature 3214906.5
2 tour_1000 UNITED KINGDOM 25-44 Alone 0.0 1.0 Visiting Friends and Relatives Cultural tourism Friends, relatives Independent ... No No No No 1.0 31.0 Cash No Excellent Experience 3315000.0
3 tour_1002 UNITED KINGDOM 25-44 Spouse 1.0 1.0 Leisure and Holidays Wildlife tourism Travel, agent, tour operator Package Tour ... Yes Yes Yes No 11.0 0.0 Cash Yes Friendly People 7790250.0
4 tour_1004 CHINA 1-24 Alone 1.0 0.0 Leisure and Holidays Wildlife tourism Travel, agent, tour operator Independent ... No No No No 7.0 4.0 Cash Yes No comments 1657500.0

5 rows × 23 columns

Feature Engineering¶

To make sure number of male , number of female , and all other features supposed to be integer ,should be converted to be int, help to bring the problrem into reality.

In [24]:
# convert float dtypes to int[total_female,total_male,night_mainland,night_zanzibar]
df["total_female"] = df['total_female'].astype('int')
df["total_male"] = df['total_male'].astype('int')
df["night_mainland"] = df['night_mainland'].astype('int')
df["night_zanzibar"] = df['night_zanzibar'].astype('int')
In [25]:
# Carry out same operation on our Test dataset
# convert float dtypes to int[total_female,total_male,night_mainland,night_zanzibar]
df_test["total_female"] = df_test['total_female'].astype('int')
df_test["total_male"] = df_test['total_male'].astype('int')
df_test["night_mainland"] = df_test['night_mainland'].astype('int')
df_test["night_zanzibar"] = df_test['night_zanzibar'].astype('int')
In [26]:
# Generate new features from some columns
df["total_people"] = df["total_female"] + df["total_male"]

df["total_nights"] = df["night_mainland"] + df["night_zanzibar"]
In [27]:
# Generate new features from some columns on Test Dataset
df_test["total_people"] = df_test["total_female"] + df_test["total_male"]

df_test["total_nights"] = df_test["night_mainland"] + df_test["night_zanzibar"]
In [28]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 25 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ID                     4809 non-null   object 
 1   country                4809 non-null   object 
 2   age_group              4809 non-null   object 
 3   travel_with            4809 non-null   object 
 4   total_female           4809 non-null   int32  
 5   total_male             4809 non-null   int32  
 6   purpose                4809 non-null   object 
 7   main_activity          4809 non-null   object 
 8   info_source            4809 non-null   object 
 9   tour_arrangement       4809 non-null   object 
 10  package_transport_int  4809 non-null   object 
 11  package_accomodation   4809 non-null   object 
 12  package_food           4809 non-null   object 
 13  package_transport_tz   4809 non-null   object 
 14  package_sightseeing    4809 non-null   object 
 15  package_guided_tour    4809 non-null   object 
 16  package_insurance      4809 non-null   object 
 17  night_mainland         4809 non-null   int32  
 18  night_zanzibar         4809 non-null   int32  
 19  payment_mode           4809 non-null   object 
 20  first_trip_tz          4809 non-null   object 
 21  most_impressing        4809 non-null   object 
 22  total_cost             4809 non-null   float64
 23  total_people           4809 non-null   int32  
 24  total_nights           4809 non-null   int32  
dtypes: float64(1), int32(6), object(18)
memory usage: 826.7+ KB

Preparing data for modeling¶

In [29]:
# let's remove ID Column
df.drop('ID', axis='columns', inplace=True)
In [30]:
# Then encode objects into numeric

for colname in df.select_dtypes("object"):
    df[colname],_=df[colname].factorize()
In [31]:
df.columns
Out[31]:
Index(['country', 'age_group', 'travel_with', 'total_female', 'total_male',
       'purpose', 'main_activity', 'info_source', 'tour_arrangement',
       'package_transport_int', 'package_accomodation', 'package_food',
       'package_transport_tz', 'package_sightseeing', 'package_guided_tour',
       'package_insurance', 'night_mainland', 'night_zanzibar', 'payment_mode',
       'first_trip_tz', 'most_impressing', 'total_cost', 'total_people',
       'total_nights'],
      dtype='object')
In [32]:
df.head()
Out[32]:
country age_group travel_with total_female total_male purpose main_activity info_source tour_arrangement package_transport_int ... package_guided_tour package_insurance night_mainland night_zanzibar payment_mode first_trip_tz most_impressing total_cost total_people total_nights
0 0 0 0 1 1 0 0 0 0 0 ... 0 0 13 0 0 0 0 674602.5 2 13
1 1 1 1 1 0 0 1 1 0 0 ... 0 0 14 7 0 1 1 3214906.5 1 21
2 1 1 1 0 1 1 1 0 0 0 ... 0 0 1 31 0 0 2 3315000.0 1 32
3 1 1 2 1 1 0 0 2 1 0 ... 1 0 11 0 0 1 0 7790250.0 2 11
4 2 2 1 1 0 0 0 2 0 0 ... 0 0 7 4 0 1 3 1657500.0 1 11

5 rows × 24 columns

In [33]:
# Spliting dependent and independent features
features_cols = df.drop(["total_cost"],1)
cols = features_cols.columns
target=df["total_cost"]
C:\Users\DAVID\AppData\Local\Temp\ipykernel_4016\3888316330.py:2: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
In [34]:
df[cols].shape , target.shape
Out[34]:
((4809, 23), (4809,))

EDA¶

  • It is a good idea to explore the data once again after manipulating it.
In [35]:
profile = ProfileReport(df, title="Pandas Profiling Report")
In [36]:
profile.to_widgets()
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]
VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Building a Regression Model¶

In [37]:
# create training and testing vars
X_train, X_test, y_train, y_test = train_test_split(df[cols],target, test_size=0.20, random_state = 2020)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
(3847, 23) (3847,)
(962, 23) (962,)
In [38]:
# %Model initialization & training
model = LinearRegression().fit(X_train,y_train)

# %Predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
In [39]:
print(model.intercept_)
print(model.coef_)
-1532685.355596223
[ -24916.7993009   248810.43669093 1070681.41948759  607557.37808411
  452550.05619344 -314383.20232077 -197859.23624628  212878.24997176
  747305.50233752 4694916.68058624 1322140.77706261  616850.95182881
 1651749.77660119 2774593.61306472  406779.60576743  223327.50069718
    9307.49419598   54591.45769385 2567290.08395854  290067.16241359
  -31166.19595691 1060107.43427755   63898.95188983]
In [40]:
predictions= model.predict(X_test)
predictions
Out[40]:
array([ 8.50050572e+05,  1.24125965e+07,  6.80871403e+06,  1.31156342e+07,
        1.27806337e+04,  1.27276763e+07,  2.88692596e+06,  8.93223640e+06,
        1.06980007e+07, -1.54734078e+05,  8.25862462e+05,  1.66198670e+07,
        1.62435771e+07,  4.82212769e+06,  8.86829431e+06,  5.33223011e+06,
        2.07295499e+06,  4.86418098e+06,  1.36601951e+07,  1.02048605e+07,
        2.31018377e+06,  1.22895274e+06,  2.53573804e+06,  5.85448915e+05,
       -1.29308411e+06,  1.73162990e+07,  1.23436597e+06,  1.29792679e+07,
        2.89214445e+07,  1.23593751e+06,  3.68618178e+05, -9.46148794e+05,
        1.05053299e+06,  3.71959909e+06,  6.60322393e+06, -1.84693771e+06,
        1.21401216e+07,  9.32888116e+05,  1.46410289e+07, -5.51387146e+05,
        3.33073545e+07,  2.61016229e+06,  1.70527451e+07,  1.12436216e+07,
        1.80653126e+07,  2.44474829e+06,  2.25303930e+06, -1.85969483e+06,
        1.26119237e+07,  5.25052396e+06,  1.46497417e+07,  3.87349083e+05,
        4.06874294e+06,  6.62582924e+06,  1.65203097e+06,  1.38778393e+07,
        2.02151122e+07,  1.72334458e+07,  8.63087807e+06,  1.28541823e+07,
        5.23954584e+06,  1.49006141e+07,  1.48648580e+07,  1.64095330e+07,
        3.48333320e+06,  5.55554654e+06,  1.25078753e+07,  3.95041363e+06,
       -1.28183943e+05,  1.13486846e+07,  2.21203922e+06,  1.41998892e+07,
        8.91874550e+06,  1.34514731e+07,  2.40439420e+06,  9.85455539e+04,
        1.31282299e+07, -1.46139902e+04,  1.26317155e+07,  1.05723909e+07,
        3.09082769e+06,  1.84580910e+07,  3.30965016e+06,  2.94732364e+05,
        1.11353212e+07,  9.92121154e+06,  1.24499721e+07,  1.19765105e+07,
        1.79383443e+06,  9.68219503e+05,  9.04215757e+06,  5.05279263e+05,
        1.27666414e+07,  1.97633976e+07,  1.67383199e+07,  8.56021731e+06,
        5.03024236e+06,  1.22105648e+07,  1.27466049e+06,  1.52987476e+07,
        1.79976917e+07, -1.81857369e+06,  1.46274223e+07,  2.10350194e+07,
        1.12253018e+07,  1.47026312e+07,  8.99281000e+06,  2.11563640e+06,
        3.14142637e+05,  9.56759853e+05,  6.70344267e+06,  5.14414365e+05,
        3.31928820e+06,  2.72090325e+06,  2.31941405e+07,  2.23329819e+07,
        4.39788609e+06,  6.47168009e+06,  9.31788528e+06,  4.90442968e+06,
        1.35592363e+06,  1.98709393e+07,  3.17091678e+06,  1.89448771e+07,
       -1.07564003e+05,  1.54137367e+07,  8.88306681e+04,  5.34648821e+06,
        1.98717739e+07, -1.59059225e+06,  4.67945678e+06,  2.77703156e+06,
        1.75333897e+06,  9.47472104e+05,  1.46888689e+06,  1.78820778e+07,
        1.86461621e+07,  9.64175666e+06, -1.61257574e+06,  1.39175803e+07,
        1.03707431e+07,  1.69675995e+07,  1.62818934e+07,  4.72706853e+05,
        6.14238838e+06,  2.43404527e+07,  2.46790367e+06,  1.32872109e+07,
        1.14822189e+06,  7.49448525e+05,  7.02932269e+05,  2.03315308e+06,
        1.01114099e+07,  2.16582120e+07,  1.22340456e+06,  1.56015778e+07,
       -3.58789486e+04,  3.67414833e+06,  2.12613344e+07,  7.93946091e+06,
        2.29104803e+07,  7.72816388e+06,  1.94798726e+07,  1.39618436e+06,
        6.14505486e+06,  5.88734773e+06,  1.51400234e+07,  1.49106335e+06,
        2.06342153e+06,  1.33188000e+06,  3.98505919e+06,  4.35125206e+06,
        2.19186064e+07,  1.05642610e+07,  1.60518678e+07,  4.65151761e+05,
        7.34052611e+05,  5.26422208e+05,  4.96787260e+06,  1.32709909e+07,
        1.71764936e+07,  1.35964782e+07,  1.05272846e+07,  2.12592597e+07,
        6.56551628e+06,  2.02946051e+07,  1.68574151e+07,  2.93700354e+06,
        6.65185959e+04,  7.32042401e+06,  5.22368498e+06,  5.72933220e+06,
        1.33921441e+07,  5.62602565e+05,  1.35634309e+07,  1.62155819e+06,
        1.75304673e+06,  1.02558299e+07,  6.32121022e+06,  1.12092501e+07,
        1.10249104e+06,  5.61685409e+06,  1.50117395e+06,  3.62227926e+06,
        8.08892963e+05,  7.89017824e+06,  7.70156967e+06,  1.52973132e+06,
        3.31785801e+05,  1.60154707e+07,  7.75091740e+06,  2.67678797e+06,
        4.23401521e+06,  2.13611447e+06,  9.49084722e+06,  1.21256810e+07,
        9.10332175e+06,  2.09220053e+06,  2.43284346e+06, -3.90531918e+05,
        3.25847557e+06,  1.64132214e+07,  1.19401934e+07,  1.99966176e+07,
        4.22617178e+06,  1.34860136e+07,  5.97788482e+05,  6.48579514e+06,
        9.66940310e+06,  1.40718057e+07,  1.02882450e+07,  2.10520995e+07,
        3.22355613e+06,  6.53195151e+06,  4.07115220e+06,  1.05902769e+07,
        1.48134053e+07,  1.88398280e+07,  2.74700635e+06,  5.16564935e+05,
        6.61744573e+06,  2.17717955e+06,  1.09626125e+07,  1.80522501e+07,
        1.70592125e+07,  1.38890534e+07,  4.82911104e+06,  1.40738132e+07,
        4.41766901e+06,  9.88677264e+06,  1.61912304e+06,  1.57597256e+07,
        4.26729073e+06,  3.87856772e+06,  1.61004019e+06,  6.55558905e+06,
        3.55779761e+06,  1.34675663e+07,  3.09139127e+06,  4.39545829e+06,
        9.57103642e+06,  2.60293964e+05,  1.28317378e+06,  9.96147587e+05,
        8.94555950e+05,  1.35978627e+07, -1.18863271e+06,  1.66821085e+07,
        3.52112712e+06,  2.69727313e+06, -1.08181357e+06,  6.38878002e+06,
        1.77157099e+07,  5.01339306e+05,  1.03527953e+07,  9.50168614e+06,
        1.44377919e+07,  3.15256041e+06,  1.87387744e+06,  1.58434663e+07,
       -9.78700913e+05,  9.57338892e+06,  2.06116939e+07,  7.98378599e+06,
        1.33209038e+06,  9.44173267e+06,  2.09950019e+06,  8.67961848e+06,
        6.17752899e+06,  1.61040098e+07, -1.17766316e+05,  1.47230461e+07,
        9.91086775e+05,  9.39197054e+06,  1.57116027e+06,  4.55817491e+06,
        1.04084726e+07,  5.33926416e+05,  1.63057895e+05,  1.58147895e+07,
       -3.28704733e+05, -5.48819215e+05,  1.46003033e+07,  1.83631328e+05,
        1.71234287e+07,  9.97720110e+06,  3.14142637e+05,  6.92470554e+06,
        1.20354208e+07,  3.72463808e+06,  2.64757278e+06,  3.33161768e+06,
        3.87468384e+06,  5.03569718e+06,  2.92427622e+06,  6.03652521e+06,
        2.07709163e+06,  6.88715928e+06,  2.77176407e+06,  2.26249830e+06,
        5.37529519e+05,  1.43249684e+07,  9.40038371e+06, -2.70274577e+05,
        2.08116919e+07,  1.88977553e+07,  1.75297125e+07,  1.14975468e+06,
        6.73565786e+06,  1.71079180e+06,  6.12487988e+06,  5.88004769e+06,
        5.92301193e+05, -1.55024334e+06,  9.29603955e+06,  1.61983057e+06,
        1.80081428e+06,  2.74370362e+06,  1.63892610e+07, -2.37501293e+06,
        1.96238016e+07,  1.25799890e+07,  7.96904573e+05,  1.84825116e+07,
        5.06914056e+06,  1.41494464e+07, -8.66211440e+05,  1.46487025e+07,
        1.43028104e+06,  1.18294684e+07,  1.02857491e+07, -1.30143291e+06,
       -1.93853288e+06,  1.82982283e+07,  2.61901301e+06,  3.40245832e+05,
       -5.62237825e+05,  1.45289604e+07,  2.55135948e+06,  2.97002152e+05,
        2.46718028e+07,  5.39227584e+06,  1.42974563e+07,  1.22126896e+07,
       -1.32028006e+05,  1.78626849e+06, -4.55503492e+05,  9.11037222e+06,
        1.51493866e+07,  3.16571587e+06,  1.12033922e+07,  3.34268216e+06,
        1.33490759e+07,  1.04861602e+07,  1.69811037e+07,  1.14172660e+07,
        1.20051231e+06,  4.84208387e+05,  3.30300259e+06,  1.17439336e+07,
        3.48044795e+06,  1.06126623e+07, -5.49774969e+04,  3.52668106e+05,
        1.50560362e+08, -5.80389680e+05,  9.77605021e+06,  3.41580390e+06,
        1.04542689e+07, -2.35803082e+06,  1.40046423e+07,  1.00841316e+07,
        1.05167279e+07,  6.38669093e+06,  8.23478629e+05,  4.76776386e+06,
        1.40550995e+06,  2.16070153e+06,  5.61452994e+06,  7.07208050e+06,
        6.82151150e+06,  2.19076222e+06,  5.03199415e+06,  1.55332417e+06,
        2.96187924e+06,  1.73337084e+07, -1.26289633e+05,  6.87458017e+05,
        1.85438885e+07,  1.25356843e+07,  7.91361806e+06,  2.78366919e+07,
        4.26132812e+06,  1.02690082e+06,  7.92405837e+06, -2.01263389e+06,
        4.07654376e+06, -4.06282912e+06,  7.93061861e+06,  2.29813333e+07,
        8.45246872e+06, -8.21633935e+05,  1.66471246e+07,  1.31928604e+07,
        1.37161001e+07,  1.14875740e+07,  4.11614387e+06,  4.36793558e+06,
        1.37845184e+07, -6.74725899e+05,  1.39399979e+07,  1.31374824e+07,
        1.62979492e+07,  1.08446947e+06,  9.73244930e+06,  2.45929484e+06,
        1.71645967e+07,  6.36111785e+06,  1.77598203e+07,  1.92945215e+06,
        4.16872814e+06,  6.93691478e+04,  5.36866018e+06,  8.27200339e+06,
        8.58777790e+05,  2.33052866e+06,  2.50641062e+07,  2.82297859e+06,
        1.13649723e+07,  1.07523256e+07,  7.29965591e+06,  1.64797726e+07,
        1.09213744e+05,  5.38570995e+06,  1.77317781e+07,  6.13089752e+05,
        4.55731790e+06,  3.53562786e+06,  3.48359599e+06,  2.06355746e+07,
        1.26566070e+07,  2.00385303e+06,  5.77418739e+06,  4.80982168e+06,
        1.01320071e+07,  3.07390831e+05,  7.17515330e+05,  7.82292093e+05,
        2.15050535e+07,  1.61245521e+07,  1.46775352e+06,  1.00812037e+07,
        3.16491698e+07,  2.96761759e+07,  1.30849549e+07,  9.27868318e+06,
        2.35584039e+07,  1.44980839e+07, -2.69799919e+04,  2.30362497e+06,
       -6.33027844e+05,  9.42444954e+06,  1.58368646e+07,  9.54339805e+06,
        1.87606726e+07,  8.90360204e+06,  2.17896018e+06,  9.77133836e+06,
        2.44848660e+07,  1.70093455e+07,  2.78444916e+05,  1.40295462e+07,
        7.00229989e+06, -3.44568003e+05,  1.03316865e+07,  2.34860198e+06,
        1.22520199e+07,  1.88674723e+07,  9.19592910e+06,  3.08322199e+06,
        2.20967071e+06,  1.20788996e+07,  1.50843796e+07,  3.12026780e+07,
       -8.47705104e+05,  3.44121341e+06,  8.36266088e+06,  8.21288514e+05,
        1.38200363e+07,  2.80217643e+06,  1.79488970e+07,  1.54940391e+06,
        8.13298245e+06,  1.06862904e+07,  6.42552275e+06,  1.87299433e+07,
        1.79077592e+07,  1.58031147e+06, -5.45342170e+04,  2.09320552e+07,
        1.43713360e+07,  1.66242222e+07,  1.77073134e+07,  1.36187719e+07,
       -8.10806917e+04,  4.26761816e+06,  6.99539826e+06,  8.51939689e+06,
        1.70342511e+07,  4.06009350e+06,  7.14278200e+05,  9.33094255e+06,
        3.92652676e+06,  8.76529273e+06, -7.18278996e+05,  1.24134123e+07,
        1.04712628e+07,  1.69588122e+07, -3.91936538e+05,  1.42350592e+07,
        8.82048092e+06,  9.72987065e+06,  3.08767756e+06,  1.82902313e+06,
        1.02648088e+06,  9.18249034e+05,  1.82289491e+04,  2.78820412e+07,
       -7.96656835e+05, -1.97286777e+06,  9.95423131e+05,  1.48319472e+05,
        1.77218058e+06,  2.25783286e+04,  4.54150358e+05,  2.47739437e+06,
        7.56275245e+06,  2.17896018e+06,  2.54172365e+06,  1.46692017e+07,
        7.00460088e+06,  1.17201561e+07, -1.67156776e+05,  6.77065736e+05,
        3.72104496e+06,  1.20026575e+07,  1.19417663e+07,  2.11827859e+07,
        3.04749831e+06,  5.58067401e+05,  1.31832972e+07,  7.71549827e+05,
        1.95362315e+07,  1.01229226e+07,  9.69142560e+05,  1.83430645e+07,
        1.23369775e+07,  1.66185608e+07,  9.90152793e+06,  2.17028775e+07,
        1.47831956e+06,  1.60454750e+07, -5.36285443e+04,  9.49230258e+04,
        1.54635542e+07,  1.08502690e+07,  2.44751025e+07,  6.77065736e+05,
       -2.06718013e+06,  2.70701513e+06,  1.18838572e+07,  5.57012916e+05,
        5.43968999e+06,  2.45086500e+06,  1.27103677e+07,  1.57247980e+07,
        6.06753499e+05,  3.11248328e+07,  1.26424215e+06,  1.63432987e+06,
        2.77117580e+06,  3.77390955e+06,  2.36809683e+07,  1.64021289e+07,
       -1.09081642e+06,  1.29668701e+06,  3.11173759e+06,  1.29203521e+06,
        1.58176796e+07,  4.21554694e+06,  1.25324364e+07,  1.11633969e+07,
        5.85814781e+06,  1.78716861e+07,  1.76055439e+06,  5.89250904e+06,
        3.78333867e+05,  8.74287983e+05,  1.80653126e+07, -1.26018155e+05,
        6.49671895e+06, -1.20093253e+05,  1.10606211e+07,  3.69302440e+06,
        1.47992828e+07,  1.93554270e+07,  1.37966231e+07,  6.97557543e+05,
        2.40790672e+06, -7.43730537e+05,  1.02558299e+07,  7.19535684e+06,
        1.45685666e+04,  1.18628887e+06, -5.74321650e+05,  2.53454995e+06,
        4.84312874e+05,  3.89210429e+06,  1.40698078e+07,  9.43439037e+06,
        1.76867666e+06, -2.17844436e+06,  1.40352463e+06,  1.99509631e+07,
        4.03541530e+06,  1.72184746e+07,  4.78577594e+06,  1.95795653e+07,
        6.62760264e+06,  9.58358982e+06,  4.95173722e+06,  6.80785188e+06,
        1.01088467e+07, -8.56312228e+05,  5.47940416e+06,  1.13862457e+07,
        1.24478044e+07,  5.26828006e+06,  1.14584967e+07,  3.94117046e+06,
        2.16466030e+06,  4.34574291e+06,  3.44734609e+06,  1.13476566e+07,
       -1.92688290e+04,  1.32609619e+07,  6.07841941e+06,  4.08268252e+06,
        6.43081242e+06,  4.10013669e+06,  3.16644905e+05,  1.17344761e+06,
        2.34753584e+07,  1.45707400e+07,  3.54180032e+05,  2.09813934e+06,
        3.57527465e+06,  4.16307090e+06,  1.22188913e+07,  2.69958835e+07,
       -1.04427368e+06,  1.02558299e+07,  1.71220541e+07,  4.72659147e+06,
        2.61359900e+06,  4.30888822e+06,  1.10873925e+07, -6.46895223e+05,
        1.03687616e+07,  1.75106013e+06,  2.86715637e+06,  5.00882220e+05,
        3.58891561e+06,  9.86826491e+06,  1.14270943e+07, -1.48476085e+05,
        5.37683765e+05,  6.26092342e+06,  2.47399086e+06,  2.66281259e+06,
        7.55673351e+06,  2.28650973e+06,  7.06572672e+06,  9.27866151e+06,
        2.94063193e+06,  2.84523172e+06,  1.70712970e+07, -9.63902425e+04,
        1.68860401e+07, -2.35788948e+06,  5.34582704e+05,  1.51331575e+07,
        2.21203922e+06,  8.74476287e+06,  1.18530267e+06,  1.22795873e+07,
       -1.96621296e+06,  1.10879115e+07,  1.81034752e+07, -7.16766150e+05,
        1.23190564e+07,  1.39291814e+07,  1.38340261e+07,  6.55111952e+06,
        5.15299536e+06,  2.47848292e+07,  2.45123129e+07,  1.05923401e+07,
        1.50500601e+07,  1.48358095e+07,  1.76149183e+07,  6.50121339e+05,
        1.18292958e+05,  1.37516180e+07,  4.40137665e+06,  5.74620318e+06,
        1.34436973e+07,  1.09028719e+07,  8.51077264e+06,  3.05750403e+07,
        3.09451534e+06,  1.12145624e+07,  3.89020644e+06,  1.75796211e+07,
        8.50898765e+05,  1.11346257e+07,  4.37414524e+06,  9.63343172e+06,
        1.54074627e+07,  2.38710388e+06,  1.64376937e+07,  1.33908448e+07,
        3.11535995e+06, -5.49774969e+04,  6.56283217e+06,  1.14385860e+07,
        2.25878838e+07,  6.48895522e+06,  1.65169482e+07, -1.32202958e+06,
        1.66440366e+06,  1.89956127e+06,  2.32675801e+06,  3.86590635e+06,
        1.45722740e+06,  8.68480831e+05,  2.15233404e+06,  2.06664279e+05,
        2.32237899e+06,  7.46726193e+06,  6.86135752e+05,  3.02530499e+06,
        1.06444658e+07,  5.11870330e+06,  1.47293256e+07,  2.07276913e+07,
        3.08512270e+06,  5.19340782e+06,  1.55771158e+07,  2.35089665e+05,
        1.94487417e+07,  1.83154980e+07,  4.47304608e+06,  2.76544838e+06,
        1.41440876e+06, -1.97888010e+04,  2.10229726e+07, -3.76994380e+05,
        7.03632542e+06,  2.06689872e+07,  1.45377109e+07,  3.57211915e+06,
        2.23855725e+07,  1.66796547e+07,  8.10318118e+05,  1.13048237e+07,
        2.64679967e+07,  2.08475787e+07,  3.26935894e+06,  9.40474916e+06,
       -1.10738959e+06,  1.71573946e+07, -3.41364592e+05,  1.74740582e+07,
        2.88037361e+06,  7.69148311e+06,  3.14289305e+06,  4.39363555e+06,
        1.65787749e+06,  2.44047317e+06,  1.78648733e+06,  1.81872093e+06,
        1.00980317e+07,  1.05717421e+07,  4.75148880e+06,  1.67922450e+07,
        1.37694633e+07,  6.23605796e+06,  1.38615845e+07,  1.25871427e+07,
        6.72684567e+04,  2.48948759e+06,  1.41788958e+07,  1.12390619e+07,
        4.70208384e+06, -3.04283077e+05,  1.15705881e+05,  1.33298658e+07,
        1.57025243e+07,  1.01119411e+07,  7.83493802e+06,  4.10262120e+06,
       -2.03389750e+06, -2.06937187e+05,  6.36582149e+06,  1.79733241e+07,
        3.38638650e+05,  1.37325193e+07,  2.34210552e+07, -1.35315196e+05,
        2.12097794e+06,  1.46817159e+07, -4.63167886e+05,  2.68108404e+06,
        1.32503895e+07,  2.06378210e+07,  6.73191506e+06,  4.79340810e+06,
        5.11593080e+05,  2.12716455e+06, -5.42439429e+05,  1.13119963e+07,
        5.71486589e+06,  9.91519290e+06,  3.78801765e+06, -7.75099417e+05,
        6.19710996e+06,  3.07093812e+06,  1.26415044e+07,  1.74533569e+07,
        9.14632121e+05,  1.95447634e+06, -1.85545606e+05,  1.73191169e+06,
        3.92527111e+05,  2.36953922e+07,  2.61755534e+05,  1.47799797e+07,
        1.59060730e+07,  1.41899427e+07,  4.50361442e+06,  9.52062888e+06,
        1.61969976e+07,  1.37348241e+07,  7.57406567e+06,  2.61653061e+06,
        2.10551437e+07,  2.10081007e+07,  1.67652221e+07,  1.43243543e+07,
        3.27612658e+05,  1.27957887e+07,  1.29086669e+07,  2.69908764e+06,
        5.04470753e+06,  1.01923424e+07,  1.58046860e+07,  1.76968666e+07,
       -2.24077675e+06,  2.93835799e+06,  6.18679127e+06,  2.62271183e+06,
        2.01379546e+07,  1.04022428e+07,  1.26866361e+07,  8.93211356e+06,
        1.71215449e+07,  2.27694709e+06,  1.93314383e+06,  1.49233987e+07,
        1.48342346e+07,  1.05264741e+07,  1.72517121e+07,  1.05470942e+05,
        1.44377919e+07,  2.77063898e+06,  2.57593754e+07,  5.71993416e+06,
        1.86415300e+07,  9.19213000e+06,  5.58350332e+06,  3.15517890e+06,
        1.40658470e+07,  2.42436594e+07,  2.94992571e+05,  4.54575265e+06,
        1.17747222e+07,  1.48252127e+06,  9.87038975e+06,  1.31325332e+07,
       -9.29810403e+05,  1.05284319e+07,  1.06658895e+07,  6.73433779e+05,
        3.70152043e+05,  1.01864755e+04,  6.41159582e+06,  1.38259003e+07,
        3.09169876e+06,  2.20232733e+07,  8.15853461e+06,  1.14385248e+06,
        1.27058165e+06,  8.70389979e+06,  1.25984606e+07,  2.07727381e+07,
        9.60320168e+06,  1.01163465e+07,  1.49692363e+07,  4.37504766e+05,
        1.10151303e+07,  1.13000086e+07,  3.88997067e+06,  8.16158014e+05,
        1.71609537e+07,  1.36497458e+07,  7.96703761e+06,  2.68665635e+06,
        1.00703646e+06,  1.35512498e+06, -7.91689504e+05,  6.91234557e+05,
        4.27194462e+06,  1.42940746e+07,  2.67379377e+06,  1.08065899e+07,
        2.01796812e+06,  4.38913634e+06,  1.74291055e+07,  9.11175291e+06,
        8.57134412e+06, -6.46812071e+05])

Evaluate Model Performances¶

In [41]:
# Evaluation for Mean Absolute Error
mae = mean_absolute_error(y_test, y_test_pred)
print('Using scikit-lean, the mae error is {}'.format(mae))
Using scikit-lean, the mae error is 5749113.958510288
In [42]:
# Evaluation for Mean Squared Error
mse = mean_squared_error(y_test, y_test_pred)
print('Using scikit-lean, the mse error is {}'.format(mse))
Using scikit-lean, the mse error is 118732536896093.78
In [43]:
# Lets use Extreme Gradient
from xgboost import XGBRegressor
C:\Users\DAVID\anaconda3\lib\site-packages\xgboost\compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.

Here we can observe changes XGBRegression shows to perform well compared to Linear Regression even without passing some parameters.

In [44]:
# %instatiate the model
XGB = XGBRegressor()

# %training the model
XGB.fit(X_train, y_train)

# %prediction
y_train_pred = XGB.predict(X_train)
y_test_pred = XGB.predict(X_test)

# % Evaluation
mae = mean_absolute_error(y_test, y_test_pred)
print('Using scikit-lean, the mae error is {}'.format(mae))

mse = mean_squared_error(y_test, y_test_pred)
print('Using scikit-lean, the mse error is {}'.format(mse))
C:\Users\DAVID\anaconda3\lib\site-packages\xgboost\data.py:250: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
Using scikit-lean, the mae error is 5391249.288207617
Using scikit-lean, the mse error is 107951570443229.86
In [45]:
# performance of Extreme gradient boost with parameter only
X_train, X_test, y_train, y_test = train_test_split(df[cols],target, test_size=0.20, random_state = 2020)

# %instatiate the model
XGB_par = XGBRegressor( n_estimators= 100, colsample_bynode = 0.8, learning_rate = 0.02,max_depth =  7)

# %training the model
XGB_par.fit(X_train, y_train)

# %prediction
y_train_pred = XGB_par.predict(X_train)
y_test_pred = XGB_par.predict(X_test)

# % Evaluation
mae = mean_absolute_error(y_test, y_test_pred)
print('Using scikit-lean, the mae error is {}'.format(mae))

mse = mean_squared_error(y_test, y_test_pred)
print('Using scikit-lean, the mse error is {}'.format(mse))
C:\Users\DAVID\anaconda3\lib\site-packages\xgboost\data.py:250: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
Using scikit-lean, the mae error is 4800634.196620062
Using scikit-lean, the mse error is 97851770809151.97

Testing Linear Model Assumptions¶

Preparing Test data¶

In [46]:
df_test.head()
Out[46]:
ID country age_group travel_with total_female total_male purpose main_activity info_source tour_arrangement ... package_sightseeing package_guided_tour package_insurance night_mainland night_zanzibar payment_mode first_trip_tz most_impressing total_people total_nights
0 tour_1 AUSTRALIA 45-64 Spouse 1 1 Leisure and Holidays Wildlife tourism Travel, agent, tour operator Package Tour ... Yes Yes Yes 10 3 Cash Yes Wildlife 2 13
1 tour_100 SOUTH AFRICA 25-44 Friends/Relatives 0 4 Business Wildlife tourism Tanzania Mission Abroad Package Tour ... No No No 13 0 Cash No Wonderful Country, Landscape, Nature 4 13
2 tour_1001 GERMANY 25-44 Friends/Relatives 3 0 Leisure and Holidays Beach tourism Friends, relatives Independent ... No No No 7 14 Cash No No comments 3 21
3 tour_1006 CANADA 24-Jan Friends/Relatives 2 0 Leisure and Holidays Cultural tourism others Independent ... No No No 0 4 Cash Yes Friendly People 2 4
4 tour_1009 UNITED KINGDOM 45-64 Friends/Relatives 2 2 Leisure and Holidays Wildlife tourism Friends, relatives Package Tour ... No No No 10 0 Cash Yes Friendly People 4 10

5 rows × 24 columns

In [47]:
# let's remove ID Column
df_test.drop('ID', axis='columns', inplace=True)
In [48]:
# Then encode objects into numeric

for colname in df_test.select_dtypes("object"):
    df_test[colname],_=df_test[colname].factorize()
In [49]:
df_test.head()
Out[49]:
country age_group travel_with total_female total_male purpose main_activity info_source tour_arrangement package_transport_int ... package_sightseeing package_guided_tour package_insurance night_mainland night_zanzibar payment_mode first_trip_tz most_impressing total_people total_nights
0 0 0 0 1 1 0 0 0 0 0 ... 0 0 0 10 3 0 0 0 2 13
1 1 1 1 0 4 1 0 1 0 0 ... 1 1 1 13 0 0 1 1 4 13
2 2 1 1 3 0 0 1 2 1 1 ... 1 1 1 7 14 0 1 2 3 21
3 3 2 1 2 0 0 2 3 1 1 ... 1 1 1 0 4 0 0 3 2 4
4 4 0 1 2 2 0 0 2 0 0 ... 1 1 1 10 0 0 0 3 4 10

5 rows × 23 columns

In [50]:
model.fit(df[cols],target)
Out[50]:
LinearRegression()
In [51]:
preds2 = model.predict(df_test)
In [52]:
preds2
Out[52]:
array([ 2941140.7776178, 10778644.7776178, 19337428.7776178, ...,
       13447932.7776178, 15230956.7776178, 15413460.7776178])
In [53]:
# Lets add the predicted Total_cost now to the Test Dataset
Final_df= Final_df.assign(predicted_price= preds2)
In [54]:
Final_df.head()
Out[54]:
ID country age_group travel_with total_female total_male purpose main_activity info_source tour_arrangement ... package_transport_tz package_sightseeing package_guided_tour package_insurance night_mainland night_zanzibar payment_mode first_trip_tz most_impressing predicted_price
0 tour_1 AUSTRALIA 45-64 Spouse 1.0 1.0 Leisure and Holidays Wildlife tourism Travel, agent, tour operator Package Tour ... Yes Yes Yes Yes 10 3 Cash Yes Wildlife 2.941141e+06
1 tour_100 SOUTH AFRICA 25-44 Friends/Relatives 0.0 4.0 Business Wildlife tourism Tanzania Mission Abroad Package Tour ... No No No No 13 0 Cash No Wonderful Country, Landscape, Nature 1.077864e+07
2 tour_1001 GERMANY 25-44 Friends/Relatives 3.0 0.0 Leisure and Holidays Beach tourism Friends, relatives Independent ... No No No No 7 14 Cash No No comments 1.933743e+07
3 tour_1006 CANADA 24-Jan Friends/Relatives 2.0 0.0 Leisure and Holidays Cultural tourism others Independent ... No No No No 0 4 Cash Yes Friendly People 1.579848e+07
4 tour_1009 UNITED KINGDOM 45-64 Friends/Relatives 2.0 2.0 Leisure and Holidays Wildlife tourism Friends, relatives Package Tour ... Yes No No No 10 0 Cash Yes Friendly People 8.607109e+06

5 rows × 23 columns

Actionable Insights and Recommendations¶

In [55]:
# Visualizing the Predictions
plt.figure(figsize=(15,7))

plt.scatter(y_train_pred,y_train_pred - y_train,
          c = 'black', marker = 'o', s = 50, alpha = 0.5,
          label = 'Train data')
plt.scatter(y_test_pred,y_test_pred - y_test,
          c = 'c', marker = 'o', s = 50, alpha = 0.7,
          label = 'Test data')
plt.xlabel('Predicted values')
plt.ylabel('Tailings')
plt.legend(loc = 'upper right')
plt.show()

We satisfactorily have a good result from the visualization above.

CONCLUSION¶

The most profitable tourism sectors in Tanzania are mainly “Wildlife tourism”, followed by “Beach tourism”, therefore it would be wise and worthwhile for investors to focus on them more. Tourist spending was more with Friends/Relatives, followed by with Spouse and children, therefore there's need for focus on developing facilities to suit these group of people. Tourist below 65 years old spend more, so it is worthwhile to encourage this age group to come to Tanzania. Most profitable visiting countries are: USA, United Kingdom, Italy, France, Australia etc.